White wine analysis by Ilaria Tavecchia

Information about the data

This dataset contains information about white wine. This tidy dataset contains 4,898 white wine with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (excellent).

Univariate Plots Section

First of all, let’s have a look at the data. All our columns are numeric, as well as the quality column that contains the wine evaluation. We have added a new column called quality_cat, that summarises the quality in 3 values: Low for quality below 4, medium for quality between 5 and 8 and high for everything above 8. Further in our analysis we will just focus on Low and High, considering that they contain roughly the same amount of values. Some descriptive statistics are shown as well below.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality      quality_cat  
##  Min.   :3.000   Low   : 183  
##  1st Qu.:5.000   Medium:4535  
##  Median :6.000   High  : 180  
##  Mean   :5.878                
##  3rd Qu.:6.000                
##  Max.   :9.000

First of all, let’s have a look at the distribution for all our variables.

The first distribution we are going to analyse is about volatile acidity, citric.acid and residual of sugar.

For the first two we can notice that some values are present in the tail. This is why we have added a boxplot to better investigate outliers both for acidity and citric acid. Boxplot can be very useful to visualize information that can be more difficult to find in a histogram. As you can notice, with this other plot type is much easier to have an idea of how many outliers we have for each tail.

Back to the distribution analysis: here for chlorides, free and total sulfur.

A closer analysis to the outliers in chlorides and free sulfur dioxide:

Next chunk of distribution is about density, pH and sulphates.

Let’s have a better look at the outliers for pH and sulphates.

Then, the last distributions about alcohol, quality and the new variable quality_cat. Here no outliers are identified.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 11 features, all numerical, and one variable called quality, that represents how good or bad each wine is. A new categorical variable, quality_cat, has been added to the dataset to differentiate between Low, Medium and High quality instead of considering a full scale from 1 to 9.

What is/are the main feature(s) of interest in your dataset?

From a first analysis we can notice that several variables seem of interest: alcohol level, residual of sugar and volatile acidity show some interesting pattern.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

In this first part we have just analysed single variables. In the bivariate analysis we might find interesting patterns between several variables that were not clearly identified here.

Did you create any new variables from existing variables in the dataset?

Yes, quality_cat is a new variable derived from quality. It has value:

  • “Low” if quality is less or equal to 4,

  • “Medium” if quality is between 5 and 7 included,

  • “High” if quality is higher or equal to 8.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Yes, from what we have seen from histogram several distributions have outliers: for all of those we have decided to plot again their histogram next to a boxplot. This way identifying outliers is much easier than with a simple histogram.

Bivariate Plots Section

The below plot represents the distribution of alcohol level for High, Low and Medium quality. As we can see, the majority of wines have a middle quality, while few are very good or bad.

In the below plot we show graphically how many values each category contains.

To have a better idea about the differences between quality categories, we have done some boxplot. The idea is to find out some discrepancies in mean and median between categories for each variables. Let’s see if this is the case.

The first boxplot below shows the alcohol distribution: as we can easily notice there is quite a difference between the average and median alcool level in High and Poor/Medium wine.

Volatile acidity boxplot: here the difference between High and Low is less clear for the median, but we can notice that Low wine have a higher mean compared to High ones. At the same time we can see that the distribution for Low is higher than High wine. Several outliers are present for Medium wine.

For fixed acidity and density, it is more complex to identify a clear difference in mean and median in each quality.

The last variable that we are analysing is residual.sugar. Here we can denote differences in the distribution and median values for all categories.

First of all let’s create a plot that contains all combination of variables with their plot and their correlation.

From the plot above we have decided to focus on the following combiantion of variables:

Below we will analyse the ones with the higher correlation.

Let’s start by density and residual sugar: in the plot below the trend is identified by the blue line.

As well as for the plot of alcohol vs density we can see a linear trend - understanding why they are negative correlated. No particular transformation have been applied, just some outliers were removed.

For total vs free sulfur we can also see that there is a relationship. Here on the x asis a square root transformation has been applied as well as some outliers have been removed from the visualization.

Also in this plot we can clearly notice a linear trend, confirming what we saw in the initial plot and correlation value.

In the next step of our analysis we are going to consider some variables together. We are starting with alcool and residual sugar: we are grouping values for each unit of alcohool (8,9, 10 etc) and then calculate the mean and median for those values.

# A tibble: 6 Ă— 4
  `round(alcohol)` sugar_mean sugar_median     n
             <dbl>      <dbl>        <dbl> <int>
1                9  10.456198        11.10  1194
2               10   6.239358         6.00  1527
3               11   4.358801         2.50  1034
4               12   4.300194         3.00   774
5               13   3.829140         2.85   314
6               14   3.335366         2.60    41

The below plot is about the mean sugar residual for alcohol levels grouped together. This clearly shows a peak of sugar mean for 9 degree of alcohol, which then decreases as the alcohol increases.

Here below the scatterplot for sugar vs alcohol and their trend.

Another analysis we are doing is grouping together density to calculate the mean and median residual sugar.

# A tibble: 6 Ă— 4
  `round(density, digit = 3)` sugar_mean sugar_median     n
                        <dbl>      <dbl>        <dbl> <int>
1                       1.000   16.78307        17.20   127
2                       1.001   18.18250        18.35    20
3                       1.002   18.65833        17.55     6
4                       1.003   26.05000        26.05     2
5                       1.010   31.60000        31.60     2
6                       1.039   65.80000        65.80     1

This plot represents density vs sugar mean value. A clear linear trend is identified here, confirming what we saw in the scatterplot above.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

First of all we have started our bivariate analysis calculating the distribution, plot and correlations for all combination of variables. Considering that we did not know much of this dataset in advance, this gave us the opportunity to focus our initial analysis on some of the features. In particular, we have analysed:

  • Density and residual sugar

  • density and alcohol corr

  • total.sulfur.dioxide and free.sulfur.dioxide corr

  • density and total.sulfur.dioxide corr.

For all of those we have found a linear trend that was easy to identify after having done some little adaptation to the plot.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

We have also analysed alcohol and sugar values: to do so we have decided to group alcohol levels by unit and calculate for each mean and median. The plot clearly shows that there is a relationship between those two variables. Higher the level of alcohol, lower the sugar level, with a peak around 9. This confirmed what we knew about alcohol and sugar level.

What was the strongest relationship you found?

Between density and sugar level we have found a strong correlation of 0.8, confirmed both by the scatterplot and by the group analysis.

Multivariate Plots Section

In this analysis we will take the main variables we considered before and plot them against their quality category. Here we are particularly interested to see if there is a clear division between High and Low wine in each plot. In all plot below we have omitted all medium categories values, showing in yellow the points about low quality and in orange the high ones.

First let’s look at density vs residual sugar. Clearly we can see that we could separate almost linearly those two categories:

In the next plot we are focusing on density vs alcohol levels. As before, we could clearly separate the values between High and Low quality of wine. Here a higher quality means higher level of alcool and lower density.

In the below plot we have analysed total vs free sulfur dioxide. It’s again possible to separate - not linearly this time - higher vs lower quality.

Having a look at fixed.acidity vs pH it shows that we cannot always separate the data between low and high. In the below plot, it is hard to define a rule, a lot of points in the middle are about both low and high quality.

Another interesting plot is density vs total sulfur. Here we could separate with an hyperplane points that are about the two categories, having high quality wine usually with medium level of sulfur dioxide and not too high density.

Let’s now have a look at a similar analysis we did already before. Here we will focus on density and residual sugar mean considering as well different quality.

# A tibble: 6 Ă— 5
  `round(density, digit = 3)` quality_cat sugar_mean sugar_median     n
                        <dbl>       <ord>      <dbl>        <dbl> <int>
1                       0.989         Low   1.050000        1.050     2
2                       0.990         Low   1.268750        1.050     8
3                       0.991         Low   2.700000        1.875    18
4                       0.992         Low   1.537500        1.100    16
5                       0.993         Low   2.911111        1.600    27
6                       0.994         Low   3.053333        1.750    30

As we can see from the plot below, the mean value of sugar are different between low and high quality.

At the same time we have grouped by alcohol, considering as well quality.

# A tibble: 6 Ă— 5
  `round(alcohol)` quality_cat sugar_mean sugar_median     n
             <dbl>       <ord>      <dbl>        <dbl> <int>
1                8         Low   4.200000         5.10     3
2                9         Low   7.584694         7.60    49
3               10         Low   3.947368         2.00    76
4               11         Low   3.580303         1.60    33
5               12         Low   3.520588         3.50    17
6               13         Low   4.725000         4.85     4

Here as well, we can see a difference between the mean value of high and low qualities.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this part of the analysis we focused mostly on analysing the relationship between Low and High quality wine. Considering that those two categories contain a similar amount of data, we have just taken them in consideration, trying to find a clear pattern between the best and worst wine. It is clear that some variables play an important role in defining best and worst wine. Several plots, like density vs alcohol or density vs residual sugar, show how we can clearly separate high vs low qualities.

Were there any interesting or surprising interactions between features?

It was interesting to see that some variables separate better than others the two categories. At the same time, it was interesting to confirm some trends that we noticed already in the bivariate analysis.


Final Plots and Summary

Plot One

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00    9.50   10.40   10.51   11.40   14.20 

Description One

This plot is a boxplot of alcohol for each different quality low, medium, high. This plot is very informative because we can see the distribution for each class, its mean and median and if outliers are present. It also shows us a clear difference between high level wine and low/medium ones.

Plot Two

Description Two

This scatterplot gives us information about the relationship of density vs residual sugar. It is very informative because we can clearly divide the two classes of wine Low and High with a simple line. It gives us an insight on the possible use of a classifier. A SVM could be used to find a hyperplane to divide the two classes of data.

Plot Three

Description Three

The third graph that I am using here it is a line plot about average residual of sugar for alcohol unit. Here we are plotting just information about high and low quality and we can see a difference till degree 13 for those two types of wine.


Reflection

It was very interesting analysing this dataset. I like drinking wine when I hang out with friends and I could tell when a wine is good or not. Till now I was not really aware of all chemical properties that we can find in wine and how they do play a role in its quality. In the beginning it was complicated to start, since I didn’t know much about this dataset . Once I got more familiar with it, I could find more interesting things. Now I have a better idea on what influences having a good wine instead of a bad one. In chemical terms, I can definitely find a correlation between density vs alcohol, residual sugar and total.sulfur.dioxide. At the same time the relationship between alcohol and residual sugar is clear and shows a clear pattern.

For future work, it would be interesting to explore more the analysis on chemical properties, but we could add more value having more generic information about the data , like type of wine, where it grows, temperature, etc. Like that the analysis would be more complete, not just focusing on chemical compounds that not everyone is well aware of, but as well more tangible facts.